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Abstract — This paper deals with arbitrarily distributed finite- 
power input signals observed through an additive Gaussian 
noise channel. It shows a new formula that connects the input- 
output mutual information and the minimum mean-square error 
(MMSE) achievable by optimal estimation of the input given the 
output. That is, the derivative of the mutual information (nats) 
with respect to the signal-to-noise ratio (SNR) is equal to half the 
MMSE, regardless of the input statistics. This relationship holds 
for both scalar and vector signals, as well as for discrete-time 
and continuous-time noncausal MMSE estimation. 

This fundamental information-theoretic result has an unex- 
pected consequence in continuous-time nonlinear estimation: For 
any input signal with finite power, the causal filtering MMSE 
achieved at SNR is equal to the average value of the noncausal 
smoothing MMSE achieved with a channel whose signal-to-noise 
ratio is chosen uniformly distributed between and SNR. 

Index Terms — Gaussian channel, minimum mean-square er- 
ror (MMSE), mutual information, nonlinear filtering, optimal 
estimation, smoothing, Wiener process. 



I. Introduction 

This paper is centered around two basic quantities in in- 
formation theory and estimation theory, namely, the mutual 
information between the input and the output of a channel, 
and the minimum mean-square error (MMSE) in estimating 
the input given the output. The key discovery is a relationship 
between the mutual information and MMSE that holds regard- 
less of the input distribution, as long as the input-output pair 
are related through additive Gaussian noise. 

Take for example the simplest scalar real-valued Gaussian 
channel with an arbitrary and fixed input distribution. Let 
the signal-to-noise ratio (SNR) of the channel be denoted 
by snr. Both the input-output mutual information and the 
MMSE are monotone functions of the SNR, denoted by /(snr) 
and mmse(snr) respectively. This paper finds that the mutual 
information in nats and the MMSE satisfy the following 
relationship regardless of the input statistics: 



1 



/(snr) = -mmse(snr). 



(1) 



dsnr ' ' 2 

Simple as it is, the identity ([l) was unknown before this 
work. It is trivial that one can compute the value of one 
monotone function given the value of another (e.g., by simply 
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composing the inverse of the latter function with the former); 
what is quite surprising here is that the overall transformation 
([0 not only is strikingly simple but is also independent of the 
input distribution. In fact, this relationship and its variations 
hold under arbitrary input signaling and the broadest settings 
of Gaussian channels, including discrete-time and continuous- 
time channels, either in scalar or vector versions. 

In a wider context, the mutual information and mean-square 
error are at the core of information theory and estimation 
theory respectively. The input-output mutual information is 
an indicator of how much coded information can be pumped 
through a channel reliably given a certain input signaling, 
whereas the MMSE measures how accurately each individual 
input sample can be recovered using the channel output. 
Interestingly, Q shows the strong relevance of mutual infor- 
mation to estimation and filtering and provides a non-coding 
operational characterization for mutual information. Thus not 
only is the significance of an identity like Q self-evident, but 
the relationship is intriguing and deserves thorough exposition. 

At zero SNR, the right hand side of Q is equal to one 
half of the input variance. In that special case the formula, 
and in particular, the fact that at low-SNR mutual information 
is insensitive to the input distribution has been remarked 
before [I], [2], [3]. Relationships between the local behavior 
of mutual information at vanishing SNR and the MMSE of 
the estimation of the output given the input are given in [4]. 

Formula Q can be proved using the new "incremental 
channel" approach which gauges the decrease in mutual in- 
formation due to an infinitesimally small additional Gaussian 
noise. The change in mutual information can be obtained as the 
input-output mutual information of a derived Gaussian channel 
whose SNR is infinitesimally small, a channel for which the 
mutual information is essentially linear in the estimation error, 
and hence relates the rate of mutual information increase to 
the MMSE. 

Another rationale for the relationship Q traces to the ge- 
ometry of Gaussian channels, or, more tangibly, the geometric 
properties of the likelihood ratio associated with signal de- 
tection in Gaussian noise. Basic information-theoretic notions 
are firmly associated with the likelihood ratio, and foremost 
the mutual information is expressed as the expectation of the 
log-likelihood ratio of conditional and unconditional measures. 
The likelihood ratio also plays a fundamental role in detection 
and estimation, e.g., in hypothesis testing it is compared to 
a threshold to decide which hypothesis to take. Moreover, 
the likelihood ratio is central in the connection of detection 
and estimation, in either continuous-time [5], [6], [7] or 
discrete-time setting [8]. In fact, Esposito [9] and Hatsell 



and Nolte [10] noted simple relationships between conditional 
mean estimation and the gradient and Laplacian of the log- 
likelihood ratio respectively, although they did not import 
mutual information into the picture. Indeed, the likelihood 
ratio bridges information measures and basic quantities in 
detection and estimation, and in particular, the estimation 
errors (e.g., [11]). 

In continuous-time signal processing, both the causal (filter- 
ing) MMSE and noncausal (smoothing) MMSE are important 
performance measures. Suppose for now that the input is 
a stationary process with arbitrary but fixed statistics. Let 
cmmse(snr) and mmse(snr) denote the causal and noncausal 
MMSEs respectively as a function of the SNR. This paper 
finds that formula Q holds literally in this continuous-time 
setting, i.e., the derivative of the mutual information rate is 
equal to half the noncausal MMSE. Furthermore, by using this 
new information-theoretic identity, an unexpected fundamental 
result in nonlinear filtering is unveiled. That is, the filtering 
MMSE is equal to the mean value of the smoothing MMSE: 



cmmse(snr) = E{mmse(r)} 



(2) 



where T is chosen uniformly distributed between and snr. 
In fact, stationarity of the input is not required if the MMSEs 
are defined as time averages. 

Relationships between the causal and noncausal estimation 
errors have been studied for the particular case of linear 
estimation (or Gaussian inputs) in [12], where a bound on the 
loss due to the causahty constraint is quantified. Capitalizing 
on earlier research on the "estimator-correlator" principle by 
Kailath and others (see [13]), Duncan [14], [15], Zakai' 
and Kadota et al. [17] pioneered the investigation of rela- 
tions between the mutual information and causal filtering of 
continuous-time signals observed in white Gaussian noise. 
In particular, Duncan showed that the input-output mutual 
information can be expressed as a time-integral of the causal 
MMSE [15]. Duncan's relationship has proven to be useful in 
many applications in information theory and statistics [17], 
[18], [19], [20]. There are also a number of other works 
in this area, most notably those of Liptser [21] and Mayer- 
Wolf and Zakai [22], where the rate of increase in the mutual 
information between the sample of the input process at the 
current time and the entire past of the output process is 
expressed in the causal estimation error and certain Fisher 
informations. Similar results were also obtained for discrete- 
time models by Bucy [23]. In [24] Shmelev devised a general, 
albeit complicated, procedure to obtain the optimal smoother 
from the optimal filter. 

The new relationship Q in continuous-time and Duncan's 
Theorem are proved in this paper using the incremental 
channel approach with increments in additional noise and ad- 
ditional observation time respectively. Formula (|2j connecting 
filtering and smoothing MMSEs is then proved by comparing 
Q to Duncan's theorem. A non-information-theoretic proof is 
not yet known for (|2ji. 

'Duncan's Theorem was independently obtained by Zakai in the more 
general setting of inputs that may depend causally on the noisy output in 
a 1969 unpublished Bell Labs Memorandum (see [16, ref. [53]]). 



In the discrete-time setting, identity Q still holds, while the 
relationship between the mutual information and the causal 
MMSEs takes a different form: We show that the mutual 
information is sandwiched between the filtering error and the 
prediction error 

The remainder of this paper is organized as follows. Section 
mi gives the central result Q for both scalar and vector 
channels along with four different proofs and discussion of 
applications. Section [111] gives the continuous-time channel 
counterpart along with the fundamental nonlinear filtering- 
smoothing relationship (|2}, and a fifth proof of Q. Discrete- 
time channels are briefly dealt with in Section Hvl Section Ivl 
studies general random transformations observed in additive 
Gaussian noise, and offers a glimpse at feedback channels. 
Section IVTl gives new representations for entropy, differential 
entropy, and mutual information for arbitrary distributions. 

II. Scalar and Vector Gaussian Channels 

A. The Scalar Channel 

Consider a pair or real-valued random variables (X, Y) 
related by- 

Y = ^/smX + N (3) 

where snr > and the N ^ JV{0, 1) is a standard Gaussian 
random variable independent of X. Then X and Y can be 
regarded as the input and output respectively of a single use of 
a scalar Gaussian channel with a signal-to-noise ratio of snr.^ 
The input-output conditional probability density is described 
by 



Py|X;snr(y|a;;snr) 



exp 



1 



(4) 



Upon the observation of the output Y, one would like to 
infer the information bearing input X. The mutual information 
between X and Y is: 



/(X;y) = E log 



P Y\x,sm{y\X:sm) 

PY;sm{Y]Sm) 



(5) 



where pY;sm denotes the well-defined marginal probability 
density function of the output: 



Py;snr(2/;snr) = E {py|^Y;snr(?/|-'^;Snr)} 



(6) 



The mutual information is clearly a function of snr, which we 
denote by 



/(snr) = / (X; ^/snr X + N) 



(7) 



The error of an estimate, f(Y), of the input X based on the 
observation Y can be measured in mean-square sense: 



\{{X-f{Y)f] 



(8) 



^In this paper, random objects are denoted by upper-case letters and their 
values denoted by lower-class letters. The expectation E {■} is taken over the 
joint distribution of the random variables within the brackets. 

^If EX^ = 1 then snr complies with the usual notion of signal-to-noise 
power ratio; otherwise snr can be regarded as the gain in the output SNR due 
to the channel. Results in this paper do not require = 1. 
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It is well-known that the minimum value of (|8]i, referred to 
as the minimum mean-square error or MMSE, is achieved by 
the conditional mean estimator: 

X(y;snr) = E{X| y;snr}. (9) 

The MMSE is also a function of snr, which is denoted by 

mmse(snr) = mmse [X \ ^/sm X + A^) . (10) 

To start with, consider the special case when the input 
distribution Px is standard Gaussian. The input-output mutual 
information is then the well-known channel capacity under 
input power constraint [25]: 



-^(snr) = ilog(l + snr) 



(11) 



Meanwhile, the conditional mean estimate of the Gaussian 
input is merely a scaling of the output: 



X{Y\snv) = 
and hence the MMSE is: 

mmse(snr) = 
An immediate observation is 



/snr 



1 



■ y, 



1 



1 + snr 



^ J(snr) = immse(snr) lege, 



(12) 



(13) 



(14) 



where the base of logarithm is consistent with the mutual 
information unit. To avoid numerous log e factors, henceforth 
we adopt natural logarithms and use nats as the unit of all 
information measures. It turns out that the relationship MAX 
holds not only for Gaussian inputs, but for any finite-power 
input. 

Theorem 1: Let N be standard Gaussian, independent of 
X. For every input distribution Px that satisfies EX^ < oo. 



dsnr 



-I X:VsmX + N) 



^mmse (X \ 



/snrX + N) 



Proof: See Section KLCl 
The identity il5\ reveals an intimate and intriguing connection 
between Shannon's mutual information and optimal estimation 
in the Gaussian channel (|3j, namely, the rate of the mutual 
information increase as the SNR increases is equal to half 
the MMSE achieved by the optimal (in general nonlinear) 
estimator 

In addition to the special case of Gaussian inputs. Theorem 
Q] can also be verified for another simple and important input 
signaling: ±1 with equal probability. The conditional mean 
estimate for such an input is given by 

X(r;snr) = tanh(^/inry) . (16) 

The MMSE and the mutual information are obtained as: 



mmse(snr) = 1 — 



tanh(snr — Vsnry) dy, (17) 



and (e.g., [26, p. 274] and [27, Problem 4.22]) 



/(snr) 



logcosh(snr — \/snry) dy (18) 
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Fig. 1. The mutual infoiTuation (in nats) and MMSE of scalar Gaussian 
channel with Gaussian and equiprobable binary inputs, respectively. 



respectively. AppendixlHverifies that J17> and ( I18t satisfy ( I15t . 

For illustration purposes, the MMSE and the mutual infor- 
mation are plotted against the SNR in Figure ^ for Gaussian 
and equiprobable binary inputs. 

B. The Vector Channel 

Multiple-input multiple-output (MIMO) systems are fre- 
quently described by the vector Gaussian channel: 



Y 



/smHX + N 



(19) 



where if is a deterministic L x K matrix and the noise TV 
consists of independent standard Gaussian entries. The input 
X (with distribution Px) and the output Y are column vectors 
of appropriate dimensions. 

The input and output are related by a Gaussian conditional 
probability density: 



Py|X;5nr(y|a;;snr) = (27r) 5 exp 



1 



— \\v — x/snr Hx\ 



(20) 

where || • || denotes the Euclidean norm of a vector The MMSE 
in estimating HX is 



mmse(snr) = E <^ H X - H XiY-snr] 



(21) 



where X{Y;snr) is the conditional mean estimate. A gener- 
alization of Theorem n is the following: 

Theorem 2: Let AT be a vector with independent standard 
Gaussian components, independent of X. For every Px 
satisfying E||X|p < oo. 



-I (X\y/smH X + N] 



dsnr 

Proof: See Section llin 



-mmse(snr). 

2 ^ ' 



(22) 



A verification of \22\ in the special case of Gaussian input 
with positive definite covariance matrix S is straightforward. 
The covariance of the conditional mean estimation error is 



X-X]{X-X] \ = ii:-' +snrH^H 



(23) 
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Fig. 2. An SNR-incremental Gaussian cliannel. 

from which one can calculate: 

2 



H\X X 



(24) 

The mutual information is [28]: 



/(X; Y) = ilogdct f7 + snrS^i?^i?S5 



(25) 



where S 2 is the unique positive semi-definite symmetric 
matrix such that (S^)^ = S. Clearly, 



dsnr 



I{X-Y) 



HiX X 



£ ^JJ^iJS 5 1(26) 
(27) 



C. Incremental Channels 

The central relationship given in Sections III- Al and Ill-B I can 
be proved in various, rather different, ways. The most enlight- 
ening proof is by considering what we call an incremental 
channel. A proof of Theorem [J using the SNR-incremental 
channel is given next, while its generalization to the vector 
version is omitted but straightforward. Alternative proofs are 
relegated to later sections. 

The key to the incremental-channel approach is to reduce 
the proof of the relationship for all SNRs to that for the special 
case of vanishing SNR, a domain in which we can capitalize 
on the following result: 

Lemma 1: As ^ — *■ 0, the input-output mutual information 
of the canonical Gaussian channel: 



(28) 



where EZ^ < 00 and U ~ A/^(0, 1) is independent of Z, is 
given by 



-o{5). 



(29) 



I{Y-Z)^-£{Z-£Zf 
Essentially, Lemma [2 states that the mutual information is 
half the SNR times the variance of the input at the vicinity of 
zero SNR, but insensitive to the shape of the input distribution 
otherwise. Lemma^has been given in [2, Lemma 5.2.1] and 
[3, Theorem 4] (also implicitly in [1])."* Lemma^is the special 
case of Theorem [2 at vanishing SNR, which, by means of the 
incremental-channel method, can be bootstrapped to a proof 
of Theorem [U for all SNRs. 

*A proof of LemmaQis given in Appendix ||3 for completeness. 



Proof: [Theorem [Q Fix arbitrary snr > and 5 > Q. 
Consider a cascade of two Gaussian channels as depicted in 
Figure |2l 



Yi 
Y2 



X + (JiNi, 
Yi+a2N2, 



(30a) 
(30b) 



where X is the input, and A^i and A^2 are independent standard 
Gaussian random variables. Let cri , 0-2 > satisfy: 

'yf = -^—j, (31a) 
snr + 

aj + al = —, (31b) 
snr 

so that the signal-to-noise ratio of the first channel ( I30a> is 
snr+(5 and that of the composite channel is snr. Such a system 
is referred to as an SNR-incremental channel since the SNR 
increases by 5 from Y2 to Yi. 

Theorem [0 is equivalent to that, as (5 ^ 0, 



I{X-Yi)-I{X-Y2 



/(snr + 5) - /(snr) (32) 
- mmse(snr) + o((5). (33) 



Noting that X — Yi — Y2 is a Markov chain, 

I{X;Yi)-I(X-Y2) = I(X-Yi,Y2)-I{X-Y2){'iA) 
= I{X-Y^\Y2), (35) 

where ( I35> is the mutual information chain rule [29]. A linear 
combination of (I30a> and ( I30b> yields 

(snr + 5)^1 = snr(y2-cr2iV2)+'5(X + cri/^i)(36) 
= sr\rY2 + 5 X + N (37) 



where we have defined 
1 



N 



{5 CTi A^i — snr (72 ^2)- 



(38) 



Clearly, the incremental channel ( 13 0> is equivalent to i37i 
paired with OObt . Due to ( 13 U and mutual independence of 
{X, Ni, N2), is a standard Gaussian random variable inde- 
pendent of X. Moreover, (X, N, aiNi + CT2A^2) are mutually 
independent since 



E{N{aiNi+a2N2)} 



0, (39) 



also due to Therefore TV is independent of (X, I2) by 
( I30t . From ( I37t . it is clear that 



I{X-Yi\Y2^y2) 

I ( X-snrY2 + 5 X + \f5 N 



Y2 



y2 



I (x-VSX + N 



Y2 



2/2 



(40) 
(41) 



Hence given Y2 ^ y2, ( I37> is equivalent to a Gaussian channel 
with SNR equal to 6 where the input distribution is Px\Y2=y2- 
Applying Lemma^to such a channel conditioned on I2 = 2/2, 
one obtains 



I(X;Yi\Y2=y2) = 
^-E{{X-E{X\Y2 



y2}y 



Yo 



2/2 



}+oi5). 



(42) 
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Fig. 3. A Gaussian pipe wliere noise is added gradually. 

Taking the expectation over Y2 on both sides of (I42t yields 

I{X-Y,\Y2)^^-£[{X-E{X\Y2]f]+o{5), (43) 

which establishes ( I33> by (I35> together with the fact that 

e|(X- E{X| Fa})^} = mmse(snr). (44) 

Hence the proof of Theorem ^ ■ 
Underlying the incremental-channel proof of Theorem Q is 
the chain rule for information: 



1=1 

When X — Yi — • — y„ is a Markov chain, (145 > becomes 

n 

I{X-Y^)^Y.I{X-Y,\Y,+^), (46) 

1=1 

where we let Yn+i = 0. This applies to a train of outputs 
tapped from a Gaussian pipe where noise is added gradually 
until the SNR vanishes as depicted in Figure |3] The sum 
in ( 146 > converges to an integral as {Yi] becomes a finer 
and finer sequence of Gaussian channel outputs. To see this 
note from ( I43> that each conditional mutual information in 
(I46> corresponds to a low-SNR channel and is essentially 
proportional to the MMSE times the SNR increment. This 
viewpoint leads us to an equivalent form of Theorem ^ 

1 r'"' 

/(snr) = — mmse(7) d7. (47) 



Therefore, as is illustrated by the curves in Figure^ the mutual 
information is equal to an accumulation of the MMSE as 
a function of the SNR due to the fact that an infinitesimal 
increase in the SNR adds to the total mutual information an 
increase proportional to the MMSE. 

The infinite divisibility of Gaussian distributions, namely, 
the fact that a Gaussian random variable can always be 
decomposed as the sum of independent Gaussian random 
variables of smaller variances, is crucial in establishing the 
incremental channel (or, the Markov chain). This property 
enables us to study the mutual information increase due to 
an infinitesimal increase in the SNR, and thus obtain the 
differential equations dlSI l and ( I22t in Theorems ^ and |2] 

The following corollaries are immediate from Theorem ^ 
together with the fact that mmse(snr) is monotone decreasing. 

Corollary 1: The mutual information /(snr) is a concave 
function in snr. 

Corollary 2: The mutual information can be bounded as 

(48) 



E {var snr}} = mmse(snr) 

2 

< — /(snr) 
snr 



(49) 



D. Applications and Discussions 

1 ) Some Applications of Theorems and ^ The newly 
discovered relationship between the mutual information and 
MMSE finds one of its first uses in relating CDMA channel 
spectral efficiencies (mutual information per dimension) under 
joint and separate decoding in the large-system limit [30], [31]. 
Under an arbitrary finite-power input distribution. Theorem ^ 
is invoked in [30] to show that the spectral efficiency under 
joint decoding is equal to the integral of the spectral efficiency 
under separate decoding as a function of the system load. The 
practical lesson therein is the optimality in the large-system 
limit of successive single-user decoding with cancellation of 
interference from already decoded users, and an individually 
optimal detection front end against yet undecoded users. This 
is a generalization to arbitrary input signaling of previous 
results that successive cancellation with a linear MMSE front 
end achieves the CDMA channel capacity under Gaussian 
inputs [32], [33], [34], [35]. 

Relationships between information theory and estimation 
theory have been identified occasionally, yielding results in 
one area taking advantage of known results from the other. 
This is exemplified by the classical capacity-rate distortion 
relations, that have been used to develop lower bounds on 
estimation errors [36]. The fact that the mutual information 
and the MMSE determine each other by a simple formula 
also provides a new means to calculate or bound one quantity 
using the other An upper (resp. lower) bound for the mutual 
information is immediate by bounding the MMSE for all 
SNRs using a suboptimal (resp. genie aided) estimator. Lower 
bounds on the MMSE, e.g., [37], lead to new lower bounds 
on the mutual information. 

An important example of such relationships is the case of 
Gaussian inputs. Under power constraint, Gaussian inputs are 
most favorable for Gaussian channels in information-theoretic 
sense (they maximize the mutual information); on the other 
hand they are least favorable in estimation-theoretic sense 
(they maximize the MMSE). These well-known results are 
seen to be immediately equivalent through Theorem ^ (or 
Theorem 12] for the vector case). This also points to a simple 
proof of the result that Gaussian inputs achieve capacity by 
observing that the linear estimation upper bound for MMSE 
is achieved for Gaussian inputs.^ 

Another application of the new results is in the analysis of 
sparse-graph codes, where [38] has recently shown that the 
so-called generalized extrinsic information transfer (GEXIT) 
function plays a fundamental role. This function is defined for 
arbitrary codes and channels as minus the derivative of the 
input-output mutual information per symbol with respect to a 
channel quality parameter when the input is equiprobable on 
the codebook. According to Theorem |2j in the special case of 
the Gaussian channel the GEXIT function is equal to minus 
one half of the average MMSE of individual input symbols 
given the channel outputs. Moreover, [38] shows that (0 leads 
to a simple interpretation of the "area property" for Gaussian 
channels (cf. [39]). Inspired by Theorem[2 [40] also advocated 

The observations here are also relevant to continuous-time Gaussian 



, /n\ r v2^ /i-n\ ine ODservations nere are aiso reiei 

< mmse(O) = var {X } . (50) ^^^^^^j^ ^-^^ ^^^^^^ SectionHni 
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using the mean-square error as the EXIT function for Gaussian 
channels. 

As another appUcation, the central theorems also provide an 
intuitive proof of de Bruijn's identity as is shown next. 

2) De Bruijn's Identity: An interesting insight is that The- 
orem |2] is equivalent to the (multivariate) de Bruijn identity 
[41], [42]: 



d 

d< 



h (^HX + VtN^ ^^tr ^^J (^HX + Vt.N^^ (51) 

where AT is a vector with independent standard Gaussian 
entries, independent of X. Here, ) stands for the differential 
entropy and J(-) for Fisher's information matrix [43], which 
is defined as^ 

J{y) = E{[Vlogpy(y)] [V logpy-Cy)]^} . (52) 



Let snr = 1/t and Y = ^/snr H X + N. Then 

L 



h {^HX 
Meanwhile, 



VtN] =I{X;Y) 



snr 
— log . 

2 ^ 27re 



J (hX + at) = snr J(r). 



(53) 



(54) 



Note that 



PV;5nr(y;snr) = E {py |X;5nr(y l^i snr) } , (55) 

where PY\X:sm{y\x'-, snr) is a Gaussian density ( I20> . It can be 
shown that 



Vlogpy;snr(?/;snr) = Vsnr-H'X(y;snr) 
Plugging ( I56t into i52\ and i54l gives 



(56) 



J{Y) 



snriJE<^ X -X] X -X 



H'. (57) 



Now de Bruijn's identity i5l\ and Theorem|2]prove each other 
by ( I53t and i51\ . Noting this equivalence, the incremental- 
channel approach offers an intuitive alternative to the con- 
ventional technical proof of de Bruijn's identity obtained by 
integrating by parts (e.g., [29]). Although equivalent to de 
Bruijn's identity. Theorem |2] is important since mutual infor- 
mation and MMSE are more canonical operational measures 
than differential entropy and Fisher's information. 

The Cramer-Rao bound states that the inverse of Fisher's 
information is a lower bound on estimation accuracy. The 
bound is tight for Gaussian channels, where Fisher's informa- 
tion matrix and the covariance of conditional mean estimation 
error determine each other by ( I57L In particular, for a scalar 
channel. 



J (VsnrX + iV) = 1 — snr • mmse(snr). 



(58) 



The gradient operator can be written as V 
ically. For any differentiable function / : — > R, 
a column vector V/(y) = [^(y), ■ • • ,-§^{y)j 



TT-r-- ,Tr-\ symbol- 
oyi dVL i 

K, its gradient at any y is 



3) Derivative of the Divergence: Consider an input-output 
pair {X, Y) connected through (|3jl. The mutual information 
I{X;Y) is the average value over the input X of the diver- 
gence D [Py\x=x\\Py) ■ Refining Theorem ^ it is possible to 
directly obtain the derivative of the divergence given any value 
of the input: 

Theorem 3: For every input distribution Px that satisfies 
EX^ < oo, 

-^D {Py^x=.\\PY) =^E { |X - Xf I X^x} 

asnr z ^^^^ 

-.E{X'N\ X ^x}, 



2Vsnr 

where X' is an auxiliary random variable which is independent 
identically distributed (i.i.d.) with X conditioned on F = 
^/snrX + N. 

The auxiliary random variable X' has an interesting physi- 
cal meaning. It can be regarded as the output of the "retrochan- 
nel" [30], [31], which takes Y as the input and generates a 
random variable according to the posterior probability distri- 
bution Px\Y;sm- The joint distribution of {X, Y, X') is unique 
although the choice of X' is not. 

4) Multiuser Channel: A multiuser system in which users 
may transmit at different SNRs can be modelled by: 



Y = HTX + N 



(60) 



where H is deterministic L x K matrix known to the receiver, 
r = diag{^snri, . . . , y'snrx} consists of the square-root of 
the SNRs of the K users, and N consists of independent 
standard Gaussian entries. The following theorem addresses 
the derivative of the total mutual information with respect to 
an individual user's SNR. 

Theorem 4: For every input distribution Px that satisfies 
EllXf <oo. 



d 



dsmk 



I{X-Y) 



1 ^ I 

= 9 E \ — [H'HU E {Cov {Xk,X,\Y- r}} , 



(61) 



where Cov{-, j-} denotes conditional covariance. 

The proof of Theorem |3 follows that of Theorem |2l in 
Appendix IIVI and is omitted. Theorems ^ and |2] can be 
recovered from Theorem |3 by setting snr/j = snr for all k. 

E. Alternative Proofs of Theorems^and^ 

In this subsection, we give an alternative proof of Theorem 
|2] which is based on the geometric properties of the likelihood 
ratio between the output distribution and the noise distribution. 
This proof is a distilled version of the more general result of 
Zakai [44] (follow-up to this work) that uses the Malliavin 
calculus and shows that the central relationship between the 
mutual information and estimation error holds also in the ab- 
stract Wiener space. This alternative approach of Zakai makes 
use of relationships between conditional mean estimation and 
likelihood ratios due to Esposito [9] and Hatsell and Nolte 
[10]. 
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As mentioned earlier, the central theorems also admit sev- 
eral other proofs. In fact, a third proof using the de Bruijn 
identity is already evident in Section III-DI A fourth proof 
of Theorems [2 and |2] by taking the derivative of the mutual 
information is given in Appendices |lll] and IIVI A fifth proof 
taking advantage of results in the continuous-time domain is 
relegated to Section Hill 

It suffices to prove Theorem|2]assuming H to be the identity 
matrix since one can always regard HX as the input. Let 
Z = ^/sm X. Then the channel il9i is represented by the 
canonical L-dimensional Gaussian channel: 



TV. 



The mutual information, which is a conditional divergence, 
admits the following decomposition [1]: 

I{Y;Z) = D {Pyiz\\Py\Pz) (63) 
= D{Pyiz\\Py'\Pz)-D{Py\\Py') (64) 

where Py is an arbitrary distribution as long as the two 
divergences on the right hand side of i64l are well-defined. 
Choose Y' — N. Then the mutual information can be ex- 
pressed in terms of the divergence between the unconditional 
output distribution and the noise distribution: 

1. 



I(Y:Z) 



rEIIZII 



DiPvWP, 



N> 



(65) 



Hence Theorem |2l is equivalent to the following: 
Theorem 5: For every Px satisfying E||X|p < oo, 



dsnr 



D {Pv^rX+N\\PN) 



(66) 



= ^E{\\E{X\V^rX + N}f}. 
Theorem [S] can be proved using geometric properties of the 
likelihood ratio 

Ky) — TT- (6^) 
PN[y) 

The following lemmas are important steps. 

Lemma 2 (Esposito [9]): The gradient of the log-likelihood 
ratio gives the conditional mean estimate: 

Vlog/(y) = E{Z| y = y}. (68) 
Lemma 3 (Hatsell and Nolte [10]): The log-likelihood ra- 
tio satisfies Poisson's equation:^ 

\/'\ogl{y) = E{\\Z\\^\Y = y}-\\E{Z\Y^y}\\ 



(69) 



\V\ogl{y)f. (70) 



From Lemmas |2] and |3] 

EjllZf I Y = y}=V'\ogliy) 

The following result is immediate. 
Lemma 4: 

E{\\Zf\Y = y}^l-\y)\/'l{y). (71) 
A proof of Theorem |5] is obtained by taking the derivative 
directly. 



'For any diiferentiable / 



L dfi 

1^1 dvi 



If / is doubly 



differentiable, its Laplacian is defined as V / = V • (V/) : 



Proof: [Theorem 15) Note that the likelihood ratio can be 
expressed as 



Ky) = 



^{PY\X,sm{y\X\Sm)] 



{ 



E \ cxp 



PN{y) 

/iH7yT^_!^||X 



Also, for any function /(•), 



cxp 



fsmy X 



\X\ 



"1} 
"1} 



l{y)E{f{X)\Y = y]. 



(72) 
(73) 

(74) 



(62) Hence, 



dsnr 



i{y) 



\iiy) 



-.y'E{X\Y = y} 



-E{l|Xf I l- = y} 



(75) 



1 

2snr 



[Z(y)y^Vlog/(y)-V2logZ(y)](76) 



where ( I76> is due to Lemmas |2] and |4] Note that the order of 
expectation with respect to Px and the derivative with respect 
to the SNR can be exchanged as long as the input has finite 
power by Lebesgue's (Dominated) Convergence Theorem [45], 
[46] (see also Lemma |8] in Appendix lIVI i. 
The divergence can be written as 



d{Py\\Pn) = j PY{y)\og 



PYjy) 
PN{y) 
E{l{N)logl{N)}, 



dy 



and its derivative 
d 



dsnr 



D {PyWPn) = E ^ log?(Ar)-^/(iV) 



(77) 
(78) 

(79) 



Again, the order of derivative and expectation can be ex- 
changed by the Lebesgue Convergence Theorem. By (I76t . the 
derivative i79i can be evaluated as 



2snr 



E {1{N) logl{N) N ■ V log/(iV)} 



2snr 



E{\ogl{N)S/'l{N)} 



2snr 



E{V- [l{N)logl{N)W\ogl{N)] 
-\ogl{N)W^l{N)} 



= — E{/(Ar)||Vlog/(iV)||2} 
- iE!|Vlog/(l^)ir 



1 



E\\E{X\Y}r, 



(80) 
(81) 
(82) 
(83) 



where to write (18 0> we used the following relationship (which 
can be checked by integration by parts) satisfied by a standard 
Gaussian vector N: 



:[N^f{N)] = E{W-f{N)} 



for every vector-valued function / : 

fi{n)e^^"'i as rii oo, i — 1, . 



(84) 
that satisfies 
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F. Asymptotics of Mutual Information and MMSE 

In the following, the asymptotics of the mutual information 
and MMSE at low and high SNRs are studied mainly for the 
scalar Gaussian channel. 

The Lebesgue Convergence Theorem guarantees continuity 
of the MMSE estimate: 



and hence 



lim E{X I r;snr} = EX, 

snr— >0 



lim mmse(snr) = mmse(O) = Cx 

snr^O 



(85) 



(86) 



where tr^^ denotes the variance of a random variable. It has 
been shown in [3] that symmetric (proper-complex in the 
complex case) signaling is second-order optimal in terms of 
mutual information for in the low SNR regime. 

A more refined study of the asymptotics is possible by 
examining the Taylor series expansion of a family of well- 
defined functions: 

q,{y- snr) = E {XVy|X;snr(y I X; snr)} , i = 0, 1, . . . (87) 

Clearly, py;snr(y; snr) — qo{y;sr\r), and the conditional mean 
estimate is expressed as 



E{X\Y = y;snr} 



(88) 



9o(y;snr)' 

Meanwhile, by definition (|5jl and noting that PY\X;sm is 



Gaussian, one has 



I{snr) = -^log{2ne)- J qo{y;snr)\ogqo{y;snr) dy. (89) 



As snr 0, 
'?i(y;snr) 
1 



2tt 



1 

l + Xysnr^ + — (y2 
X' 



l)snr 



(90) 



+ ^(y' - 3)2;snr^ + 24 " + 
X^ 

+ Y^(15y-102;' + ?/')snr^ 

+ ^(y' - 152/" + W - 15)snr3 + O(snri) 

Without loss of generality, it is assumed that the input X 
has zero mean and unit variance. Using ( I88> -(I90>. a finer 
characterization of the MMSE and mutual information is 
obtained as 

^\£xy 



mmse snr 



=1 — snr + snr 



-2{£X^f 



15 



snr-"^ + O (snr'*) 



(91) 



and 



N 1 1 2 1 3 1 

-^(snr) =-snr - -snr^ + -snr^ - — 



£X* 



QEX'^ -2{EX^ 



15 



O (snr'5) 



(92) 



respectively. It is interesting to note that that higher order 
moments than the mean and variance have no impact on the 
mutual information to the third order of the SNR. 



The asymptotic properties carry over to the vector channel 
model M9\ for finite-power inputs. The MMSE of a real- 
valued vector channel is obtained to the second order as: 



B(snr) =tr|ifSJ?^| 



snr • tr ^HY^H^HllH^^ + ©(snr^) 



(93) 



where S is the covariance matrix of the input vector The 
input-output mutual information is straightforward by Theo- 
rem 12] (see also [4]). The asymptotics can be refined to any 
order of the SNR using the Taylor series expansion. 

At high SNRs, the mutual information is upper bounded 
for finite-alphabet inputs such as the binary one (I18> . whereas 
it can increase at the rate of ^ log snr for Gaussian inputs. 
By Shannon's entropy power inequality [25], [29], given any 
symmetric input distribution with a density, there exists an a G 
(0, 1] such that the mutual information of the scalar channel 
is bounded: 

i log(l + asnr) < /(snr) < ^log(l + snr). (94) 

The MMSE behavior at high SNR depends on the input 
distribution. The decay can be as slow as 0(l/snr) for 
Gaussian input, whereas for binary input, the MMSE decays as 
g-25nr f^^j.^ jjjg MMSE can be made to decay faster than 
any given exponential for sufficiently skewed binary inputs 
[31]. 

III. Continuous-time Gaussian Channels 

The success in the discrete-time Gaussian channel setting 
in Section im can be extended to more technically challenging 
continuous-time models. Consider the following continuous- 
time Gaussian channel: 



Rt 



(95) 



where {Xt} is the input process, and {Nt] & white Gaussian 
noise with a flat double-sided power spectrum density of unit 
height. Since {Nt] is not second-order, it is mathematically 
more convenient to study an equivalent model obtained by 
integrating the observations in ( I95> . In a concise form, the 
input and output processes are related by a standard Wiener 
process {Wt} independent of the input [47], [48]: 



m = ^/smXt dt + AWt, t G [0, T]. 



(96) 



Also known as Brownian motion, {Wt} is a continuous 
Gaussian process that satisfies 



E{VFtl^4 = min(t,s), Vt,s. 



(97) 



Instead of scaling the Brownian motion (as is customary in 
the literature), we choose to scale the input process so as to 
minimize notation in the analysis and results. 

A. Mutual Information and MMSEs 

We are concerned with three quantities associated with the 
model ( I96> . namely, the causal MMSE achieved by optimal 
filtering, the noncausal MMSE achieved by optimal smooth- 
ing, and the mutual information between the input and output 
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processes. As a convention, let denote the process {Xt} 
in the interval [a,b]. Also, let yUx denote the probability 
measure induced by {Xt} in the interval of interest, which, for 
concreteness we assume to be [0, T]. The input-output mutual 
information is defined by [49], [50]: 



l{X^;Yo^) = J log<PdfixY 



if the Radon-Nikodym derivative 

d^XY 



(98) 



(99) 



exists. The causal and noncausal MMSEs at any time t E [0, T] 
are defined in the usual way: 



and 



cmmse(t,snr) = E|(Xt - E{Xt | yo*;snr})^| , (100) 

1 

mmse(i,snr) ^E^{Xt-E{Xt \ yo^;snr})^| . (101) 

Recall the mutual information rate (mutual information per 
unit time) defined as: 



1 



/(snr) = -/(X„^;y„^^ 



(102) 



Similarly, the average causal and noncausal MMSEs (per unit 
time) are defined as 

1 F 

cmmse(snr) = — / cmmse(f, snr) d< (103) 
T Jo 

and 

1 

mmse(snr) = — / mmse(i, snr) dt (104) 

T Jo 

respectively. 

To start with, let T ^ oo and assume that the input to the 
continuous-time model ( I96> is a stationary^ Gaussian process 
with power spectrum Sxi(^)- The mutual information rate was 
obtained by Shannon [51]: 

I{sm) = \J log(l + snr5x(c^))-^. (105) 

With Gaussian input, both optimal filtering and smoothing are 
linear. The noncausal MMSE is due to Wiener [52], 



mmse snr) = / 



Sx{(^) 



diO 



■\rSxiyi) 277 ' 

and the causal MMSE is due to Yovits and Jackson [53]: 



(106) 



cmmse(snr) 



1 



snr 



log(l + snr5A-M)-^. (107) 



From MQ5\ and ( I106t . it is easy to see that the derivative 
of the mutual information rate is equal to half the noncausal 
MMSE, i.e., the central formula Q holds literally in this case. 
Moreover, ( I105> and ( I107> show that the mutual information 
rate is equal to the causal MMSE scaled by half the SNR, 

*For stationary input it would be more convenient to sliift [0, T] to 
[-T/2,T/2] and tlien let T ^ oo so that the causal and noncausal MMSEs 
at any time t S (—00,00) is independent of t. We stick to [0, T] in this 
paper for notational simpHcity in case of general inputs. 



although, interestingly, this connection escaped Yovits and 
Jackson [53]. 

In fact, these relationships are true not only for Gaussian 
inputs. 

Theorem 6: For every input process {Xt] to the Gaussian 
channel | |96> with finite average power, i.e.. 



EXt dt < 00, 



(108) 



the input-output mutual information rate and the average 
noncausal MMSE are related by 



/(snr) = -mmse(snr). 

dsnr ^ 2 ^ ^ 



(109) 



Proof: See Section BlTcl 
Theorem 7 (Duncan [15]): For any input process with fi- 
nite average power, 

snr 

/(snr) = -— cmmse(snr). (HO) 
Together, Theorems |6] and Q show that the mutual informa- 
tion, the causal MMSE and the noncausal MMSE satisfy a 
triangle relationship. In particular, using the information rate 
as a bridge, the causal MMSE is found to be equal to the 
noncausal MMSE averaged over SNR. 

Theorem 8: For any input process with finite average 
power, 

1 

cmmse(snr) = — / mmse(7) dj. (Ill) 
snr Jq 

Equality il 1 1> is a surprising fundamental relationship be- 
tween causal and noncausal MMSEs. It is quite remarkable 
considering the fact that nonlinear filtering is usually a hard 
problem and few analytical expressions are known for the 
optimal estimation errors. 

Although in general the optimal anti-causal filter is different 
from the optimal causal filter, an interesting observation that 
follows from Theorem [S] is that for stationary inputs the 
average anti-causal MMSE per unit time is equal to the average 
causal one. To see this, note that the average noncausal MMSE 
remains the same in reversed time and that white Gaussian 
noise is reversible. 

It is worth pointing out that Theorems 0j8l are still valid if 
the time averages in (ll02> -( fT04l i are replaced by their limits as 
T ^ 00. This is particularly relevant to the case of stationary 
inputs. 

Besides Gaussian inputs, another example of the relation in 
Theorem |8l is an input process called the random telegraph 
waveform, where {Xt} is a stationary Markov process with 
two equally probable states (Xt = ±1). See Figure |3 for an 
illustration. Assume that the transition rate of the input Markov 
process is i/, i.e., for sufficiently small h. 



P{Xt+h=Xt} = l-J^h + o{h), 



(112) 



the expressions for the MMSEs achieved by optimal filtering 
and smoothing are obtained as [54], [55]: 



cmmse(snr) 



/cx 
1 



U 2 



(U — 1) 2 e sm du 



W2 (u — 1) 2 ( 



du 



(113) 
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where 



Fig. 4. Sample path of the input and output processes of an additive white 
Gaussian noise channel, the output of the optimal causal and anti-causal filters, 
as well as the output of the optimal smoother. The input {Xt} is a random 
telegraph waveform with unit transition rate. The SNR is 15 dB. 



0.8 



. 6 



0.4 



0.2 




Fig. 5. The causal and noncausal MMSEs of continuous-time Gaussian 
channel with the random telegraph waveform input. The rate u = 1. The two 
shaded regions have the same area due to Theorem Isl 



and 



mmse(snr) = 



J-1 J-1 -{l-xf{l-yril+x){l+v) 



dx dy 



W 2 (U — 1) 2 6 "iinr" du 



(114) 

respectively. The relationship (II 1 U is verified in Appendix [V] 
The MMSEs are plotted in Figure |5] as functions of the SNR 
for unit transition rate. 

Figure |3 shows experimental results of the filtering and 
smoothing of the random telegraph signal corrupted by ad- 
ditive white Gaussian noise. The optimal causal filter follows 
Wonham [54]: 



dXt=- 



2iyXt + snrXt [l - X, 
^ ViH7 (l - X?) dYt, 



dt 



(115) 



Xt = E{Xt\Y^}. 



(116) 



The anti-causal filter is merely a time reversal of the filter of 
the same type. The smoother is due to Yao [55]: 



pr^ 1^,^, _ E{Xt\Y^} + E{X,\Y,^} 
^i^t\roi i + E{X,\Y^}E{X,\Yn 



(117) 



B. Low- and High-SNR Asymptotics 

Based on Theorem |8] one can study the asymptotics of the 
mutual information and MMSE under low SNRs. The causal 
and noncausal MMSE relationship implies that 

mmse(O) — mmse(snr) 



lim 

snr^o cmmse(O) — cmmse(snr) 



(118) 



where 



1 r 

cmmse(O) = mmse(O) = - / EX"^ dt. (119) 
^ Jo 

Hence the initial rate of decrease (with snr) of the noncausal 
MMSE is twice that of the causal MMSE. 

In the high-SNR regime, there exist inputs that make 
the MMSE exponentially small. However, in case of Gauss- 
Markov input processes, Steinberg et al. [56] observed that the 
causal MMSE is asymptotically twice the noncausal MMSE, 
as long as the input-output relationship is described by 



dYt = ^/^rh{Xt) dt+ dWt 



(120) 



where h{-) is a differentiable and increasing function. In the 
special case where h{Xt) ~ Xt, Steinberg et al/s observation 
can be justified by noting that in the Gauss-Markov case, the 
smoothing MMSE satisfies [57]: 



mmse(snr) = 



which implies according to il 1 U that 
cmmse(snr) 

lim 



1 

snr 



= 2. 



(121) 



(122) 



5nr->oo tnmse(snr) 

Unlike the universal factor of 2 result in ( II 18> for the low SNR 
regime, the 3 dB loss incurred by the causality constraint fails 
to hold in general in the high SNR asymptote. For example, 
for the random telegraph waveform input, the causality penalty 
increases in the order of log snr [55]. 

C. The SNR-Incremental Channel 

Theorem|6lcan be proved using the SNR-incremental chan- 
nel approach developed in Section |ll| Consider a cascade of 
two Gaussian channels with independent noise processes: 

dYu = Xtdt + oi dWu, (123a) 
dY2t = dYu + (J2dW2t, (123b) 

where {VFit} and {VF2t} are independent standard Wiener 
processes also independent of {Xt}, and cri and (72 satisfy (13 It 
so that the signal-to-noise ratios of the first channel and the 
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composite channel are snr + S and snr respectively. Following 
steps similar to those that lead to i37\ . it can be shown that 

(snr + S) dYu = snr dFa? +SXtdt + Vs dWt, (124) 

where {W*} is a standard Wiener process independent of {Xt] 
and {i2t}- Hence conditioned on the process {i2t} in [0,r], 
can be regarded as a Gaussian channel with an SNR of 
5. Similar to Lemma Q] the following result holds. 

Lemma 5: As iS 0, the input-output mutual information 
of the following Gaussian channel: 



m ^VSZtdt+dWt, te [0, T], 



(125) 



where {Wt} is standard Wiener process independent of the 
input {Zt}, which satisfies 



EZ^ dt<oo, 



is given by the following: 



liml/(Z-;y- 



E(Zt-EZt) dt. 



(126) 



(127) 



Proof: See Appendix IVII ■ 
Applying Lemma [S] to the Gaussian channel (I124t condi- 
tioned on {Y2t} in [0,r], one has 



/(Xo^;ri^|K,^o) 
= y\{{X,-E{X,\Y,l})'}dt- 



-o{S). 



(128) 



Since {Xt} — {Yu} — {1^2*} is a Markov chain, the left hand 
side of SlM is recognized as the mutual information increase: 



I{XI,YI,)- 
T[I{snr + 5) 



I {xl,Yl^)(n9) 

-/(snr)]. (130) 



By ( I130> and definition of the noncausal MMSE (IIOH 
can be rewritten as 



11281 



/(snr - 



mmse(t,snr) dt + o{5). (131) 



Hence the proof of Theorem |6l 

The property that independent Wiener processes sum up to a 
Wiener process is essential in the above proof. The incremental 
channel device is very useful in proving integral equations 
such as in Theorem |6] 

D. The Time-Incremental Channel 

Note Duncan's Theorem (TheoremQ that links the mutual 
information and the causal MMSE is also an integral equation, 
although implicit, where the integral is with respect to time 
on the right hand side of ( II lOt . Analogous to the SNR- 
incremental channel, one can investigate the mutual informa- 
tion increase due to an infinitesimal additional observation 
time of the channel output using a "time-incremental channel". 
This approach leads to a more intuitive proof of Duncan's 
Theorem than the original one in [15], which relies on intricate 
properties of likelihood ratios and stochastic calculus. 



Duncan's Theorem is equivalent to 



l{Xl-Y^) 



snr 



= 5^-fE[{Xt~E{Xt\Y^]f]+o{5), 



(132) 



which is to say the mutual information increase due to the 
extra observation time is proportional to the causal MMSE. 
The left hand side of ( I132> can be written as 

i{xl^'-X'-')~i{xl-X) 



= I {XI xr°; rj, Yrn - 1 (x^; Yl 



t+S. vt vt+S 

t I 
t+S. vt+s 

t+s. 



(133) 



t+S I vt+S 



xrx) 



+/(x*,x*+^;ro*)-/(x*;yo*) 



(134) 



IX. 



t + S. yt + S 



\Yr 



0, 



l{Xl,-Y, 



t+S 



X, 



t+S 



yt 



~i{xi+'-X\K) 



(135) 

Since Fq*— X^— is a Markov chain, the last 
two mutual informations in jl35> vanish due to conditional 
independence. Therefore, 



l{Xl+'-Yl+''^ 



/(x*+^y/+*|yo*), (136) 



i.e., the increase in the mutual information is the conditional 
mutual information between the input and output during the 
extra time interval given the past observation. Note that 
conditioned on Y^, the probability law of the channel in 
(t, t + 5) remains the same but with different input statistics 
due to conditioning on Y^. Let us denote this new channel by 



dYt^ ^/snrXtdt+ dWt, te[0,(5], 



(137) 



where the time duration is shifted to [0, (5], and the input 
process Xq has the same law as Xl'^'^ conditioned on Yq*. 
Instead of looking at this new problem of an infinitesimal time 
interval [0, (5], we can convert the problem to a familiar one by 
an expansion in the time axis. Since ^/6 Wt/s is also a standard 
Wiener process, the channel M311 in [0, 5] is equivalent to a 
new channel described by 



dYr = ^/5smXrAT+ dW'^, re [0,1], 



(138) 



where Xr — XrS, and {VF/} is a standard Wiener process. 
The channel ( I138> is of (fixed) unit duration but a diminishing 
signal-to-noise ratio of (5 snr. It is interesting to note that the 
trick here performs a "time-SNR" transform. By Lemma |5l 
the mutual information is 



5 snr 
S snr 



E{Xr - EXr)^ dr + o{S) 



(139) 
(140) 



^ E^^{Xt+rS-E{Xt+rs\ >^o;snr})'} dr 



MS) 



(141) 



S snr . 



{{Xt-E{Xt\Y^;sm}y}+o{S), (142) 

where ( I142> is justified by the continuity of the MMSE. The 
relation (^2} is then established by ( I136l l and ( I142> . and hence 
the proof of Duncan's Theorem. 
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Similar to the discussion in Section IlI-CI the integral 
equations in Theorems |6l and proved by using the SNR- 
and time-incremental channels are also consequences of the 
mutual information chain rule applied to a Markov chain of 
the channel input and degraded versions of channel outputs. 
The independent-increment properties of Gaussian processes 
both SNR-wise and time-wise are quintessential in establishing 
the results. 

E. A Fifth Proof of Theorem 

A fifth proof of the mutual information and MMSE relation 
in the random variable/vector model can be obtained using 
continuous-time results. For simplicity Theorem ^ is proved 
using Theorem The proof can be easily modified to show 
Theorem |2l using the vector version of Duncan's Theorem 
[15]. 

A continuous-time counterpart of the model (|3jl can be 
constructed by letting Xt = X for t e [0, 1] where X is a 
random variable independent of t: 

dYt = VsnrX dt + AWt- (143) 

For every u € [0,1], is a sufficient statistic of the 
observation for X (and X^). This is because that the 
process {Yt — {t/u)Yu], t e [0, u], is independent of X (and 
Xq). Therefore, the input-output mutual information of the 
scalar channel (|3} is equal to the mutual information of the 
continuous-time channel ( I143l i: 

/(snr) = liX: Yi) = I {X^'X) . (144) 

Integrating both sides of jl43L one has 

Yu = Vsnr uX + Wu, ^€[0, 1], (145) 

where Wu ~ Af{Q,u). Note that (I145> is a scalar Gaussian 
channel with a time-varying SNR which grows linearly from 
to snr. Due to the sufficiency of Y^, the MMSE of the 
continuous-time model given the observation Fq", i.e., the 
causal MMSE at time u, is equal to the MMSE of a scalar 
Gaussian channel with an SNR of ?^snr: 

cmmse(u, snr) = mmse(it snr). (146) 

By Duncan's Theorem, the mutual information can be written 
as 

I{X^;Y^) = ^ cmmse(u,snr) dw (147) 
2 Jo 

snr 

= -— / mmse(usnr) du (148) 
2 Jo 

1 n' 

= - / mmse(7) d'y. (149) 

2 Jo 

Thus Theorem [2 follows by also noticing ( I144> . 

Note that for constant input applied to a continuous-time 
Gaussian channel, the noncausal MMSE at any time t (llOU 
is equal to the MMSE of a scalar channel with the same SNR: 

mmse(t, M snr) = mmse(M snr), Vt E [0,T]. (150) 



Together, jl46> and jl50> yield il 1 1> for constant input by 
averaging over time u. Indeed, during any observation time 
interval of the continuous-time channel output, the SNR of 
the desired signal against noise is accumulated over time. The 
integral over time and the integral over SNR are interchange- 
able in this case. This is another example of the "time-SNR" 
transform which appeared in Section IIII-DI 

Regarding the above proof, note that the constant input can 
be replaced by a general form of X h(t), where h{t) is a 
deterministic signal. 

IV. Discrete-time Gaussian Channels 

A. Mutual Information and MMSE 

Consider a real-valued discrete-time Gaussian-noise channel 
of the form 

Y, = ^rX,+N,, i = l,2,..., (151) 

where the noise {Ni} is a sequence of independent standard 
Gaussian random variables, independent of the input process 
{Xj}. Let the input statistics be fixed and not dependent on 
snr. 

The finite-horizon version of ( I151> corresponds to the 
vector channel il9\ with H being the identity matrix. Let 
X" = [Xi, X„]\ y" = [Fi, . . . , y„r, and AT" = 
[A^i , . . . , Nnf. The relation i22\ between the mutual infor- 
mation and the MMSE holds due to Theorem |2] 

Corollary 3: If J27=i ^Xf < oo, then 

H 1 " 
/ (X"; V^rX" + AT") = - V mmsefi, snr), (152) 

i=l 

where 

mmse(i,snr) = e|(X, -E{X,\ Y";sm}f^ (153) 
is the noncausal MMSE at time i given the entire observation 

It is also interesting to consider optimal filtering and pre- 
diction in this setting. Denote the filtering MMSE as 

cmmse(i,snr) = e|(X, - E { X, | l"*;snr})^|, (154) 

and one-step prediction MMSE as 

pmmse(i,snr) = e|(X, - E{X, | Y'-^;sm}f^ . (155) 

Theorem 9: The input-output mutual information satisfies: 

n 

^^cmmse(i,snr) < /(X";^") (156a) 

i=l 

n 

snr ■r-^ 

< y ^ pmmse(z, snr).(156b) 

2=1 

Proof: We study the increase in the mutual information 
due to an extra sample of observation by considering a concep- 
tual time-incremental channel. Since — X* — Xi+i — Yi+i 
is a Markov chain, the mutual information increase is equal 
to 

/(X'+i;Y'+i) -/(X';r^) =/(X,+i;y-,+i|y') (157) 
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using an argument similar to the one that leads to ( I136> . This 
conditional mutual information can be regarded as the input- 
output mutual information of the simple scalar channel (|3} 
where the input distribution is replaced by the conditional 
distribution Pxi^iiY^- By Corollary |2] 

E{var{X,+i|r*+i;snr}} < — / | F*) (158) 

< E{var{X,+i|r';snr}},(159) 

or equivalently, 

sn r ■ sn r 

-^cmmse(i, snr) < / (X^+i; Y^^i \ y') < -^pmmse{i, snr). 

(160) 

Finally, we obtain the desired bounds in Theorem |5] summing 
( I160> over n and using ( I157> . ■ 
Corollary |3] and Theorem |9] are still valid if all sides are 
normalized by n and we then take the limit of n oo. As a 
result, the derivative of the mutual information rate (average 
mutual information per sample) is equal to half the average 
noncausal MMSE per symbol. Also, the mutual information 
rate is sandwiched between half the SNR times the average 
causal and prediction MMSEs per symbol. 

B. Discrete-tune vs. Continuous-time 

In previous sections, the mutual information and the es- 
timation errors have been shown to satisfy similar relations 
in both discrete- and continuous-time random process models. 
Indeed, discrete-time processes and continuous-time processes 
are related fundamentally. For example, discrete-time process 
can be regarded as the result of integrate-and-dump sampling 
of the continuous-time one. 

It is straightforward to recover the discrete-time results 
using the continuous-time ones by considering an equivalent 
of the discrete-time model M5\\ as a continuous-time one with 
piecewise constant input: 

dFt = x/inrXpt^ di + dVFt, ie[0,oo). (161) 

During the time interval (i — 1, i] the input to the continuous- 
time model is equal to the random variable Xi. The samples 
of {Yt} at natural numbers are sufficient statistics for the 
input process {Xji}. Thus, Corollary |3] follows directly from 
Theorem |6l Analogously, Duncan's Theorem can be used to 
prove Theorem |9] [31]. 

Conversely, for sufficiently smooth input processes, the 
continuous-time results (Theorem |6] and Duncan's Theorem) 
can be derived from the discrete-time ones (Corollary |3] and 
Theorem |9}. This can be accomplished by sampling the 
continuous-time channel outputs and taking the limit of all 
sides of M56\ with vanishing sampling interval. However, 
in their full generality, the continuous-time results are not 
a simple extension of the discrete-time ones. A complete 
analysis of the continuous-time model involves stochastic 
calculus as developed in Section HH] 

V. Generalizations 

A. General Additive-noise Channel 

Consider a general setting where the input is preprocessed 
arbitrarily before contamination by additive Gaussian noise. 



X 





Pz\x 









N 



e — -Y 



Fig. 6. General additive-noise channel. 

The scalar channel setting as depicted in Figure |6l is first 
considered for simplicity. 

Let X be a random object jointly distributed with a real- 
valued random variable Z. The channel output is expressed as 



Y = VsnrZ + 7V, 



(162) 



where the noise N ^ M{Q, 1) is independent of X and Z. 
The preprocessor can be regarded as a channel with arbitrary 
conditional probability distribution Pz\x- Since X — Z — Y is 
a Markov chain. 



IiX;Y)=I{Z;Y)-I{Z;Y\X) 



(163) 



Note that given {X, Z), the channel output Y is Gaussian. 
Two applications of Theorem [l] to the right hand side of ( I163l l 
give the following: 

Theorem 10: Let X — Z — Y be a Markov chain and Y ~ 
^/smZ + N. If EZ2 < oo, then 

^I{X;Y) 

isnr . . . (164) 

--£[{Z-£{Z\ Y,X;sm}f]. 
The special case of this result for vanishing SNR is given 
by Theorem 1 of [4] . As a simple illustration of Theorem [Tol 
consider a scalar channel where X ~ A/" (O, cr|-) and Pz\x is a 
Gaussian channel with noise variance a^. Then straightforward 
calculations yield 



iE{(Z-E{Z|r;snr})^} 



1 / snrcr 



X 



(165) 



the derivative of which is equal to half the difference of the 
two MMSEs: 



'X 



1 + snr(o-^ +0-2) 1 



snrcr^ 



(166) 



In the special case where the preprocessor is a deterministic 
function of the input, e.g., Z ~ g{X) where g(-) is an arbitrary 
deterministic mapping, the second term on the right hand 
side of jl64t vanishes. If, furthermore, g(-) is a one-to-one 
transformation, then I{X] Y) = I{g{X)]Y), and 



dsnr 



I{X-^/smg{X) + N) 



\£[{g{X)^E{g{X)\Y-,sm}f] 



(167) 



Hence J15> holds verbatim where the MMSE in this case is 
defined as the minimum error in estimating g{X). Indeed, the 
vector channel in Theorem |2] is merely a special case of the 
vector version of this general result. 



13 



One of the many scenarios in which the general result can be 
useful is the intersymbol interference channel. The input Zi to 
the Gaussian channel is the desired symbol X, corrupted by a 
function of the previous symbols (X^^i, Xi_2, • • ■ )■ Theorem 
[Tol can possibly be used to calculate (or bound) the mutual 
information given a certain input distribution. Another domain 
of applications of Theorem ^| is the case of fading channels 
known or unknown at the receiver, e.g., the channel input Z = 
AX where A is the multiplicative fading coefficient. 

Using similar arguments as in the above, nothing prevents 
us from generalizing Theorem |6l to a much broader family of 
models: 

dYf = VsnrZt At + AWu (168) 

where {Zt) is a random process jointly distributed with X, 
and {Wt\ is a Wiener process independent of X and {Zt}. 

Theorem 11: As long as the input {Zt} to the channel J168> 
has finite average power, 

^/(A-;y7)^5^f E{(Z.-E{Z,|y-s„r))'} 

-E{(Zt-E{Zt| r(f,X;snr})'} At. 

(169) 

In case Zt = gt{X), where gt{-) is an arbitrary deterministic 
one-to-one time-varying mapping. Theorems 1618 1 hold verba- 
tim except that the finite-power requirement now applies to 
gt{X), and the MMSEs in this case refer to the minimum 
errors in estimating gt{X). 

B. Gaussian Channels With Feedback 

Duncan's Theorem (Theorem can be generalized to the 
continuous-time additive white Gaussian noise channel with 
feedback [17]: 

AYt=^/smZ{t,Y^,X) At+ AWt, te[0,T], (170) 

where X is any random message (including a random process 
indexed by t) and the channel input {Zt} is dependent on 
the message and past output only. The input-output mutual 
information of this channel with feedback can be expressed as 
the time-average of the optimal filtering mean-square error: 

Theorem 12 (Kadota, Zakai and Ziv [17]): If the power of 
the input {Zt} to the channel ( I170l i is finite, then 

B{(z{t,Ylx) 

^ Ja (171) 
-£{Z{t,YlX) \ ro*;snr})'} At. 
Theorem El is proved by showing that Duncan's proof of 
Theorem remains essentially intact as long as the channel 
input at any time is independent of the future noise process 
[17]. A new proof can be conceived by considering the time- 
incremental channel, for which (I136> holds literally. Naturally, 
the counterpart of the discrete-time result (Theorem |9jl in the 
presence of feedback is also feasible. 

One is tempted to also generalize the relationship between 
the mutual information and smoothing error (Theorem |6} to 
channels with feedback. Unfortunately, it is not possible to 
construct a meaningful SNR-incremental channel like (I123> in 



this case, since changing the SNR affects not only the amount 
of Gaussian noise, but also the statistics of the feedback, 
and consequently the transmitted signal itself. We give two 
examples to show that in general the derivative of the mutual 
information with respect to the SNR has no direct connection 
to the noncausal MMSE, and, in particular, 

^/(X;y-)4f E{(Z(U'.',X) ^^^^^ 

-E{i?(t,y„',X)|y„'';snr})'}dt. 

Having access to feedback allows the transmitter to deter- 
mine the SNR as accurate as desired by transmitting known 
signals and observing the realization of the output for long 
enough.'' Once the SNR is known, one can choose a patho- 
logical signaling: 

Z{t,Y^,X)^ X/y^r. (173) 

Clearly the output of channel MlQl remains the same re- 
gardless of the SNR. Hence the mutual information has zero 
derivative, while the MMSE is nonzero. In fact, one can choose 
to encode the SNR in the channel input in such a way that 
the derivative of the mutual information is arbitrary (e.g., 
negative). 

The same conclusion can be drawn from an alternative 
viewpoint by noting that feedback can help to achieve essen- 
tially symbol error-free communication at channel capacity 
by using a signaling specially tailored for the SNR, e.g., 
capacity-achieving error-control codes. More interesting is the 
variable-duration modulation scheme of Turin [58] for the 
infinite-bandwidth continuous-time Gaussian channel, where 
the capacity-achieving input is an explicit deterministic func- 
tion of the message and the feedback. From this scheme, we 
can derive a suboptimal noncausal estimator of the channel 
input by appending the encoder at the output of the decoder 
Since arbitrarily low block error rate can be achieved by the 
coding scheme of [58] and the channel input has bounded 
power, the smoothing MMSE achieved by the suboptimal 
noncausal estimator can be made as small as desired. On the 
other hand, achieving channel capacity requires that the mutual 
information be nonnegligible. 

Note that a fundamental proviso for our mutual information- 
MMSE relationship is that the input distribution not be allowed 
to depend on SNR. However, in general, feedback removes 
such restrictions. 

C. Generalization to Vector Models 

Just as Theorem ^ obtained under a scalar model has its 
counterpart (Theorem|2j under a vector model, all the results in 
Sections Unland llVl can be generalized to vector models, under 
either discrete- or the continuous-time setting. For example, 
the vector continuous-time model takes the form of 

AYt = ^/snrXt At + AWt, (174) 

'The same technique apphes to discrete-time channels with feedback. If 
instead the received signal is in the form of dYt = Zt{t,YQ , X) dt + 
(l/-y/snr) dWt, then the SNR can also be determined by computing the 
quadratic variation of Yj during an arbitrarily small interval. 
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where {VFt} is an m-dimensional Wiener process, and {Xt} 
and {Yt} are ?7i-dimensional random processes. Theorem |6l 
holds literally, while the mutual information rate, estimation 
errors, and power are now defined with respect to the vector 
signals and their Euclidean norms. In fact, Duncan's Theorem 
was originally given in vector form [15]. It should be noted 
that the incremental-channel devices are directly applicable to 
the vector models. 

In view of the above generalizations, the discrete- and 
continuous-time results in Sections IV-AI and IV-BI also extend 
straightforwardly to vector models. 

Furthermore, colored additive Gaussian noise can be treated 
by first filtering the observation to whiten the noise and recover 
the canonical model of the form J162> . 

D. Complex-valued Channels 

The results in the discrete-time regime (Theorems [^^5l and 
Corollaries hold verbatim for complex-valued channel 
and signaling if the noise samples are i.i.d. circularly symmet- 
ric complex Gaussian, whose real and imaginary components 
have unit variance. In particular, the factor of 1/2 in M5\ . 
M2\ . M52\ and M56\ remains intact. However, with the more 
common definition of snr in complex-valued channels where 
the complex noise has real and imaginary components with 
variance 1/2 each, the factor of 1/2 in the formulas disappears. 

The above principle holds also under continuous-time mod- 
els as long as the complex-valued Wiener process is appropri- 
ately defined. This is straightforward by noting that in general 
complex-valued models can be regarded as two independent 
uses of the real-valued ones (with possibly correlated inputs 
in the two uses). 

VI. New Representation of Information Measures 

The relationship between mutual information and MMSE 
enables other information measures such as entropy and di- 
vergence to be expressed as a function of MMSE as well. 

Consider a discrete random variable X. Assume N ~ 
N{Q, 1) independent of the input throughout this section. The 
mutual information between X and its observation through a 
Gaussian channel converges to the entropy of X as the SNR 
of the channel goes to infinity. 

Lemma 6: For every discrete real-valued random variable 
X, 

H{X)= lim l{X;y/snrX + N). (175) 

snr — ^ 

Proof: See Appendix IVIII ■ 
Note that if H{X) is infinity then the mutual information in 
( fTTSl also increases without bound as snr ^ oo. Moreover, the 
result holds if X is subject to an arbitrary one-to-one mapping 
g(-) before going through the channel. In view of (US?} and 
(I175> . the following theorem is immediate. 

Theorem 13: For any discrete random variable X taking 
values in A, the entropy of X is given by (in nats) 

HiX) 

= - J E[{giX)-E{g{X)\V^rg{X) + N}y}dsnr 

(176) 



for any one-to-one mapping g : ^ ^ R. 

It is interesting to note that the integral on the right hand 
side of ( I176> is not dependent on the choice of g{-), which is 
not evident from estimation-theoretic properties alone. 

The "non-Gaussianness" of a random variable (divergence 
between its distribution and a Gaussian distribution with the 
same mean and variance) and, thus, the differential entropy 
can also be written in terms of MMSE. To that end, we need 
the following auxiliary result. 

Lemma 7: Let X be any real-valued random variable and 
X' be Gaussian with the same mean and variance as X, i.e., 
X' ^ Af (EX, a\). Let Y and Y' be the output of the channel 
(|3} with X and X' as the input respectively. Then 

D{Px\\Px-)= lim D{Py\\Py'). (177) 

snr — ^oo 

Proof: By monotone convergence and the fact that data 
processing reduces divergence. ■ 
Note that in case the divergence between Px and Px' is 
infinity, the divergence between Py and Py also increases 
without bound. Since 

D{Py\\Py')^I{X'-Y')-I{X-,Y), (178) 

the following result is straightforward by Theorem [2 

Theorem 14: For every random variable X with a\ < oo, 
its non-Gaussianness is given by 

(179) 



Dx = D{Px\mEX,aj,)) 

^ - mmse(X|ViFf?X+iV)dsnr(180) 



JQ J. T ^ii.iv^ 

Note that the integrand m ( I179> is always positive since for 
the same variance, Gaussian inputs maximize the MMSE. 
Also, Theorem ^5 holds even if the divergence is infinity, for 
example in the case that X is a discrete random variable. In 
light of Theorem the differential entropy of X can be 
expressed as: 



MX) = 1 log 
1 



(27re4) 



"X 

1 + snrcr^ 



D 



X 



(181) 



- mmse X VsnrX+iV dsnr.(182) 



According to ( I179> . jx ~ e~^^ is a parameter that measures 
the difficulty of estimating X when observed in Gaussian 
noise across the full range of SNRs. Note that < jx < 1 
with the upper bound attained when X is Gaussian, and the 
lower bound attained when X is discrete. Adding independent 
random variables results in a random variable that is harder to 
estimate in the sense of the following inequality: 



"7ii + (1 - a)-/x^ < 7x^+X2 



(183) 



where Xi and X2 are independent random variables and a is 
the ratio of the variance of Xi to the sum of the variances of 
Xi and X2- Of course, ( I183> is nothing but Shannon's entropy 
power inequality [25]. It would be interesting to see if ( I183t 
can be proven from estimation-theoretic principles. 

Another observation is that Theorem ^| provides a new 
means of representing the mutual information between an 
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arbitrary random variable X and a real-valued random variable 
Z: 



Appendix I 
Verification of ( fTsl : Binary Input 



I{X-Z)=-j^ £[{E{Z\V^rZ + N,X]Y 
- (E{Z| VsnrZ + iV})^| dsnr. 



(184) 



An arbitrary discrete valued Z can be handled as in Theorem 
I13lbv means of an adequate one-to-one mapping. 

The above results can be generalized to continuous-time 
models and vector channels. It is remarkable that the entropy, 
differential entropy, divergence and mutual information in 
fairly general settings admit expressions in pure estimation- 
theoretic quantities. It remains to be seen whether such repre- 
sentations lead to new insights and applications. 



Vll. Conclusion 



Proof: From ([HJ and ( I18> . it can be checked that 



2 /(snr) — mmse(snr) 

dsnr ^ ' ^ ' 

X tanh (snr — Vsnry) dy 



1 



/snr 



1 



/2n 



-.e 2 



;tanh(Vsnrz) dz, 



(185) 



(186) 



where from ( I185l l to ( I186> v^inr — y is replaced by z. 
The integral in ( I186> can be regarded as the expectation of 
Ztanh (-^snr^) where Z ^ A/'(Vsnr, 1). The expectation 
remains the same if Z is replaced by Z' ^ A/'(— v^snr, 1) due 
to symmetry. Hence the integral can be rewritten by averaging 
over the two cases as: 
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This paper reveals that the input-output mutual information 
and the (noncausal) MMSE in estimating the input given 
the output determine each other by a simple formula under 
both discrete- and continuous-time, scalar and vector Gaussian 
channel models. A consequence of this relationship is the 
coupling of the MMSEs achievable by smoothing and filtering 
with arbitrary signals corrupted by Gaussian noise. Moreover, 
new expressions in terms of MMSE are found for information 
measures such as entropy and divergence. 

The idea of incremental channels is the underlying basis 
for the most streamlined proof of the main results and for 
their interpretation. The white Gaussian nature of the noise 
is key to this approach: 1) The sum of independent Gaussian 
variates is Gaussian; and 2) the Wiener process has indepen- 
dent increments. In fact, the relationship between the mutual 
information and the noncausal estimation error holds in even 
more general settings of Gaussian channels. In a follow-up 
to this work, Zakai has recently extended formula Q to the 
abstract Wiener space [44], which generalizes the classical rn- 
dimensional Wiener process. 

The incremental-channel technique in this paper is relevant 
for an entire family of channels the noise of which has 
independent increments, i.e., that is characterized by Levy 
processes [59]. A particular interesting case, which is reported 
in [60], is the Poisson channel, where the corresponding 
mutual information-estimation error relationship involves an 
error measure quite different from mean-square error. 

Applications of the relationships revealed in this paper are 
abundant. In addition to the application in [30] to multiuser 
channels, [38] shows applications to key results in EXIT charts 
for the analysis of sparse-graph codes. Other applications 
as well as counterparts to non-Gaussian channels will be 
published in the near future. In all, the relations shown in this 
paper illuminate intimate connections between information 
theory and estimation theory. 



e 2 



z tanh ( vsnr z) 



dz 



2tt 



= -{EZ-EZ') 



'snr. 



/2n 

dz 

z^= (187) 



(188) 
(189) 



Therefore, ( I186> vanishes by ( I189> , and ( I15l l holds. 



Appendix II 
Proof of Lemma[1] 



Proof: By (I64> . the mutual information admits the 
following decomposition: 



IiY;Z) = D {Py\z\\Py'\Pz) - D {Py\\Py' 



(190) 



where Y' ~ J\f (EY,(Jy)- Let the variance of Z be denoted 
by V. The first term on the right hand side of ( I190> is equal 
to a divergence between two Gaussian distributions, which is 
found as 



ilog(l + H ^^ + o{S) 



by using the general formula: 

1 1 f {mi- mn)2 

-2i°^^+H^^ 

It suffices then to show that 



(191) 



4-1 lege. 



(192) 



DiPy\\PY>) = E{\og^^}^oiS), 



(193) 
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which is straightforward to check by plugging in the density 
functions: 



log 



pviy) 
PY'iy) 

■- log 



/27r 



E < exp 



log 



log E i exp 



exp 



2{6v + l) 



(194) 



2{dv + l) 2^^ ' 



+ -log{l + Sv) 

log e|i + VSy{Z - EZ) + ^{,/{Z - £Zf 



-Z^ + {£Zf)+o{5)\ + -\og{\ + 5v) 



= log(^l 
= o{5)- 



5v 



■ log(l + Sv) + o{S) 



(195) 

- Vj/ 

(196) 

(197) 
(198) 



The limit and the expectation can be exchanged to obtain 
J197t as long as EZ^ < oo due to the Lebesgue Convergence 
Theorem. ■ 
It is interesting to note that the proof relies on the fact that 
the divergence between the output distributions of a Gaussian 
channel under different input distributions is sublinear in the 
SNR when the noise dominates. 

Appendix III 
A Fourth Proof of Theorem[T] 

Proof: For simplicity, it is assumed that the order 
of expectation and derivative can be exchanged freely. A 
rigorous proof is relegated to Appendix II VI where every such 
assumption is validated in the more general vector model. 

Let qi{y;snr) be defined as in ( I87t . It can be checked that 
for all i. 



9j(2/;snr) 

j^=yqi+i{y;sm) - ^q,+2iy;snr) (199) 



dsnr 



1 



-g,+i(y;snr). 



(200) 



2Y^srir dy 

The derivative of the mutual information, expressed as j89l l. 
can be obtained as 
d 



dsnr 



/(snr) 



= - y [loggo(y;snr) + 1] -^(7o(2/;snr)dy (201) 

= ^ ! — /loggo(?/;snr) -^gi(y;snr)dy (202) 
2Vsnr J dy 



1 



2y/svn 
1 



qi{y;snr) d 

— go(y;snr) dy 



go (y; snr) dy 



2^/snr 



'7i(y;snr) 



y — ysnr 



gi(y;snr) 
90 (y; snr) 



(203) 



dyj(204) 



ds 



where ( I203> follows by integrating by parts. Noting that the 
fraction in ( I204l i is exactly the conditional mean estimate 
(cf. il), 

L/(snr) = -^EfE{X| y;snr} 
nr 2\/snr L 

X [y- VinrE{X| y;snr}]| (205) 

1 f XY 1 
= -E|-^-(E{X| y;snr})n(206) 

= ^e{(X -E{X\Y:sm}f} (207) 
= -mmse(snr). (208) 



Appendix IV 
Proof of Theorem|2] 

Proof: It suffices to prove the theorem assuming H ~ I 
since one can always regard HX as the input. The vector 
channel M9\ has a Gaussian conditional density MQl . The 
unconditional density of the channel output is given by \55l . 
which is strictly positive for all y. The mutual information can 
be written as (cf. i&9V ) 



(209) 



/(snr) = - ■^log(27re) 



Hence, 



- J PY;snr{y;snr) logpy^sprC?/; snr) dy. 

-/(snr) = - y [\ogpY-snr{y] snr) + 1] 

^-r—PY:sm{y;snr) dy 
dsnr 

= - J [logpy;5nr(y; snr) + 1] 



(210) 



xE 



dsnr 



Py|X;snr(y|^;snr) } dy(211) 



where the derivative penetrates the integral in illQi by the 
Lebesgue Convergence Theorem, and the order of taking the 
derivative and expectation in ( 121 U can be exchanged by 
Lemma |S] which is shown below in this Appendix. It can 
be checked that (cf. ( fT99t and ( l200l ) 

PY\X;snr{y\x;snr) 

^ ! — {y - Vsnr a;) Py\x ismiylx; snr) (212) 
2Vsnr ^ / i . 



dsnr 



1 



2^/snr 



a5^Vpy|X;5nr(2/|a5;snr). 



(213) 



Using ( I213> . the right hand side of ( 121 1> can be written as 
^-^Ejx^y [logpy;snr(y;snr) + 1] 

X Vpy|X;5nr(y|^;snr) dy i. 



(214) 
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The integral in (I214t can be carried out by parts to obtain 

- J PY\X;smiy\X; snr) V [logpy-smiy; snr) + 1] dy, 

(215) 

since for all a;, as ||y|| ^ oo, 

PY\x-sn,{y\x] snr) [\ogpY:sm{y; snr) + 1] ^ 0. (216) 
Hence, the expectation in ( I214> can be further evaluated as 
^TPY\X;snr{y\X:snr) 



PY;smiy;snr) 



Vpv;5nr(y;snr) dy 

(217) 

where we have changed the order of the expectation with 
respect to X and the integral (i.e., expectation with respect to 
Y). By j213> and Lemma |5] (shown below in this Appendix), 
iini can be written as 



J e{x^| y = y;snr} 



(218) 



xE{(y-VsnrX) PY\X;sur{y\X;snr)} dy. 
Therefore, ( 121 1> can be rewritten as 
d 



dsnr 



/(snr) 
1 

2Vsnr 

X 



E X' 



y; snr 



E{y - VsnrXj Y = y; snr}pY;sm{y; snr) dy(219) 



2^/inr 2 



= E|E{X^|y;snr}E| 

= E|ll|X|p-i||E{X|r;snr}|| 
= lE{l|X-E{X|r;snr}l|^}. 



Y:sm )■ H220) 
(221) 
(222) 



Hence the proof of Theorem |2l ■ 
The following two lemmas were needed to justify the 

exchange of derivatives and expectation with respect to Px in 

the above proof. 

Lemma 8: If EllXlP < oo, then 



dsnr 



E{pv|X;5nr(y|^;snr)} 



Proof: Let 



dsnr 



PY\X;snriy\X;snr) 



and 



fs{x, y, snr) [pY\X:sm{y\X] snr + 5) 
- PY\x-snv{y\X]snr) 



f{x,y,sm) = - — PY\X;sm{y\x;sm). 



(223) 



(224) 



/(a;,y,snr) as S 



(225) 
0. 



dsnr 

Then, Va;,y, snr, fs{x,y,sm) 
Lemma |8] is equivalent to 

lim J fs{x,y,snr)Px{dx) = J f{x,y,snr)Px{dx). 

(226) 

Suppose we can show that for every 5,x,y and snr, 

\fs{x,y,snr)\<\\x\\^ + ^\y^x\. (227) 



/snr 



Then ( I226> holds by the Lebesgue Convergence Theorem since 
the right hand side of (|2^} is integrable with respect to Px 
by the assumption in the lemma. Note that 



-||y — Vsnr + S : 



fs{x,y, snr) =(27r) 2 _ ( exp 



' exp 



If 



\y^x\ 



(228) 
(229) 



then ( I227l i holds trivially. Otherwise, 



|/A-(a;,y,snr) 



< 

< 
< 



1 



exp 



l\\y 



1 



||y — Vsnr + S x\ 



1 



1 

26 
1 

26 



exp 
exp 



(5||a;|p — (Vsnr + 6 — y/sm)y^x 
1 



- 1 



snr 



\yM 



- 1 



(230) 
(231) 
(232) 



The inequality ( I227> holds for all a;, y, snr due to the fact that 



e-l<2t, VO<i<l. 

Lemma 9: If EX exists, then for i = 1, . 

d 



(233) 



■EK|X;snr(l^l^;snr)} 



d 



= E|— pv|X;snr(l^|X;snr) 
Proof: The proof is similar to that for Lemma |8] Let 



(234) 



gs{x, y, snr) =- [pY\X;sUy + S e^jX; snr) 
- PY\X;sm{y\X;snr)] 



(235) 



where is a vector with all zero except on the i* entry, which 
is 1. Then, Vx, y, snr, 

d 

lim gs{x, y, snr) = ^PY\x-smiy\x; snr). (236) 

We show that 

1515(0;, y,snr)| < |y,| + 1 + Vsnr 



(237) 



so that ( I234> holds by the Lebesgue Convergence Theorem 
(cf. ( l226l l). Note that 



gs{x,y, snr) ^{2tt) ^ ^ ( exp 



1,1 

--\\y + de^ 



— Vsnrx 



If 



exp 



< \y^ 



-■^Wy ^ Vsnrx\ 



/snr 



(238) 
(239) 
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then J237> holds trivially. Otherwise, 



\gs{x,y, snr)\ 



Vsnra;|| 



+ Sei- \/snra;|p 



- 1 



25 I 



- {2yi + S- 2y/snrxi) 



(240) 
1 I , (241) 



and ( l237t holds by ( l233i . ■ 
Appendix V 

Verification of (II 1 U : Random Telegraph Input 
Let £ = and define 

^ snr 

/oo 
i)ie«" dw. (242) 

It can be checked that 

= /(i + 2,j)-/(^J + 2), (243) 
= /(« + 2,j), (244) 

-Uihj) = ^/(*-2,j) + |/(*,J- 2), (245) 

where verifying ( I245l l entails integration by parts. Then (^Oj 
can be rewritten as 

cmmse(snr) = /(-I, -1) (246) 

and hence 

-^[snr-cmmse(snr)] = [/(-I, -1) ^^47) 
- -1) + ^/(-l, -l)/(3, -l)]/f{l, -1). 

With the change of variables t = {1 — x^)^^ and u = (1 — 



mmse(snr) = / ^(1,-1) 



1 t+u-1 



(248) 



The denominator in ( I248t prevents the double integral from 
being separated. This can be circumvented by taking derivative 
with respect to ^. Noting that 

e« [e-«/'(l, -1) mmse(snr)] = f{l, -1), (249) 

the identity ( II 1 1> is equivalent to 

+ f/(-l,-l)/(3,-l))] =/'(!, -1) 

since both sides of dl 1 1> tend to as ^ ^ —00. With the help 
of ( I243t -( l245t . verifying (125 Ot is a matter of algebra. 



Appendix VI 
Proof of Lemma|5] 

Lemma |5] can be regarded as a consequence of Duncan's 
Theorem. The mutual information can be expressed as a time- 
integral of the causal MMSE: 

/ {Zl, If) ^^-j\{Zt-£{Zt \ rj; 5]f di, (251) 

As the SNR 5 — > 0, the observation becomes inconsequen- 
tial in estimating the input signal. Indeed, the causal MMSE 
estimate converges to the unconditional mean in mean-square 
sense: 

£{Zt\Y^-5} ^£Zt. (252) 

Putting ( 125 1> and M52\ together proves Lemma |5] 

In parallel with the proof of Lemma [fl another reasoning 
of Lemma |5l from first principles without invoking Duncan's 
Theorem is presented in the following. In fact. Lemma [S] is 
established first in this way so that a more intuitive proof of 
Duncan's Theorem is given in Section ITlI-DI using the idea of 
time-incremental channels. 

Proof: [Lemma|5l By definition ( I98> , the mutual informa- 
tion is the expectation of the logarithm of the Radon-Nikodym 
derivative ( I99l l. which can be obtained by the chain rule as 



$ = 



d/i 



YZ 



d/i 



YZ 



(253) 



First assume that {Zf} is a bounded uniformly stepwise 
process, i.e., there exists a finite subdivision of [0,T], = 
< t\ < ■ ■ ■ < tn ^ T , and a finite constant M such that 

Zt(uj)^ZtM. i = 0,...,n-l, (254) 

and Zt(tj) < M, Vt e [0,T]. Let Z = [Zt^, . . . , Zt„], 
Y = . . . , FtJ, and W = . . . , Wt^ be (n + 1)- 

dimensional vectors formed by the samples of the random 
processes. Then, the input-output conditional density is Gaus- 
sian: 



VY\z{y\z) = Jl 



1 



X exp 



V27r(ii+i - ti) 



2(ti+i — ti) 



(255) 



Easily, 

VYz{h,z) _ PY\zib\z) 
Pwz{b,z) pw{b) 



n-1 



= exp Zi{bi+i-bi) - - zj{U+i-t 

1=0 1=0 

Thus the Radon-Nikodym derivative can be established as 



(256) 
(257) 



d^YZ 

dfi 



cxp 



wz 



y/5 Zt dWt 



T 



Zfdt 



(258) 



using the finite-dimensional likelihood ratios (I257l i. It is clear 
that ^jiYz < Mvvz- 
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For the case of a general finite-power process (not nec- 
essarily bounded) {Zt}, a sequence of bounded uniformly 
stepwise processes which converge to the {Zt} in L^{dtdP) 
can be obtained. The Radon-Nikodym derivative (125 8> of the 
sequence of processes also converges. Absolute continuity 
is preserved. Therefore, (125 8> holds for all such processes 

The derivative ( I258> can be rewritten as 



rences of dWt by dYt = VS Zt + dWt: 

$' = 1 + /" Zt-EZt dYt 
Jo 



T \ 2 

Zt - EZt dYt 







dnwz 



1 + VS [ ZtdWt 
Jo 



T 



T \ 2 

Zt - EZt m 







Z^t - EZf dt 







Zt dWt] Zt dt 



(259) 



-o{5) 



(264) 



o{5). 



l + VS [ Zt-EZt dWt 
Jo 



By the independence of the processes {Wt} and {Zt}, the 
measure fiwz ~ fJ-wl^-z- Thus integrating on the measure fiz 
gives 



T \ 2 

Zt - EZt dWt 



d^iY 

dfj.w 
5 

+ 2 



-l + VS [ EZt dWt 
Jo 

ZtdWt 



Zt-EZtdWt^ + [Zt-EZtfdt 
o{5) (265) 



EZf dt 



(260) 



o{6). 



+ / E{Zt - EZtf dt 
Jo 

i + Vs f 

Jo 



Using ( l259t , (l260l and the chain rule ( l253l . the Radon- 
Nikodym derivative $ exists and is given by 



Zt dWt 

T \ 2 

ZtdWt] ~E,. 







T \ 2 

Zt AWt 













f zf dt + 


f EZfdt 


+ o{6) 


Jo 


Jo 





^ = 1 + VS J Zt-EZtdWt + - ij Zt dWtj where Zt = Zt - EZf Hence 



(266) 



2 



Zt dt-2\ EZt mt I Zt - EZt dWt ^* '^^^ 

0(6) (261) 



t + 









[ EZf dt 




Jo 



zl dt 



T \ 2 

Zt dWt 







EZt dt 



= 1 + VS [ Zt-EZt dWt 
Jo 







V-z 
T 



T \ 2 

Zt-EZtdWt\ 

T \ 2 

Zt - EZt dWt 



10 JO 

Therefore, the mutual information is 
Elog$' 

_ S 
2 



(267) 



-oiS). 







Zt - EZt dt 







0(6). 



(262) 



2 / EZtdt-E{ / ZtdWt 



EZt dt 



EZt dt 



EZf dt + o((5), 



- o((5)(268) 
o{6) (269) 
(270) 



Note that the mutual information is an expectation with respect 
to the measure fiyz- It can be written as 



J \og<i>' dfiYz (263) 
where $' is obtained from $ ( I262t by substituting all occur- 

'"A shortcut to the proof of 12581 is by the Girsanov Theorem [47]. 



and the lemma is proved. 

Appendix VII 
Proof of Lemma|6] 

Proof: Let Y = Vsnr.g(X) + N. Since 

< H{X) - I{X; Y) = H{X\Y), 



(271) 



it suffices to show that the uncertainty about X given Y 
vanishes as snr ^ oo: 



lim H{X\Y) = 0. 



snr — >oo 



(272) 
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Assume first that X takes a finite number (to < oo) of 
distinct values. Given Y , let X be the decision for X that 
achieves the minimum probability of error, which is denoted 
by p. Then 

H{X\Y) < H{X\X) < p log(m - 1) + iJsb), (273) 

where i?2( ) stands for the binary entropy function, and the 
second inequality is due to Fano [29]. Since p ^ as snr ^ 
oo, the right hand side of ( I273> vanishes and ( I272t is proved. 

In case X takes a countable number of values and that 
H{X) < oo, for every natural number to, let Um be an 
indicator which takes the value of 1 if X takes one of the m 
most likely values and otherwise. Let X,„ be the function 
of Y which minimizes P |x 7^ X,n\Um ^ Then for every 

TO, 

H{X\Y) 

< H{X\X^) (274) 
= H{X,U^\X„,) (275) 
= H{X\X^,U^) + H{U^\X^) (276) 

< P{Um = l]H{X\Xm.Um^l) 

+ P{C/™ = 0}F(X|X™, [/„, = 0) + H{U„,){211) 

< P{C/„ = l}ff(X|l™,C/„ = l) 

+ P{C/™ = 0}H(X) + i/2(P{C/m = 0}). (278) 



The conditional probability of error P |x 7^ 

vanishes as snr — s- 00 and so does H{X\X„nU,, 
Fano's inequality. Therefore, for every m. 



1) by 



snr^oo 



lim_ H{X\Y) < P{Ur,^ = 0}H{X) + H2iP{U„^ = 0}). 

(279) 
as 



The limit in ( I279> must be since P{U,n = 0} 
TO ^ 00. Thus ( I272I 1 is also proved in this case. 

In case H{X) = 00, H{X\Um = 1) — > 00 as to ^ 00. For 
every to, the mutual information (expressed in the form of a 
divergence) converges: 

lim D {Py\x,u^=i\\Py\u^=i\Px\u^=i) = H{X\U^ = 1). 

snr— >cx3 

(280) 

Therefore, the mutual information increases without bound as 
snr ^ 00 by also noticing 

I{X-Y) 

>I{X-Y\Ura) (281) 

>P{C/m = 1}D {Py\x,u^=i\\Py\u^=i\Px\u„.=i\2%2) 
We have thus proved ( I175> in all cases. ■ 
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